12 research outputs found

    Sõnatähenduste normimise traditsioon ja selle murdmine eesti keelekorralduses

    Get PDF
    Eesti keelekorralduse traditsioon normida õigekeelsussõnaraamatuga (ÕS) üldkeele sõnade tähendusi on pikalt kujundanud ühiskonna arusaama (eesti) keele toimimisest. Alates 1980. aastatest on keelekorralduses sõnade tähendusi normimise asemel „leebemalt“ suunatud, neid kirjakeelde sobimatuks peetud ja nende kohta soovitusi antud (mh ÕSides 1999–2018). Artiklis anname ülevaate eesti keele sõnatähenduste normimise ajaloost üldisemalt alates 19. sajandi teisest poolest, analüüsides olulisemaid keelekorraldusega seotud väljaandeid. Teeme üldistusi ja järeldusi sõnatähenduste normimise kohta, tuginedes oma varasemale sõnatähendustega seotud uurimistööle: kasutuspõhise keeleteooria ja korpuslingvistika süvenemisega eesti keeleteaduses on ka keelekorralduses jõutud arusaamale, et tähendused ei püsi sõnaraamatus kinni: (uue) tähenduse kasutusse jäämine sõltub keelesisestest ja -välistest teguritest, mitte ÕSist. Abstract. Lydia Risberg, Margit Langemets: Breaking the tradition of standardizing word meanings in Estonian language planning. The tradition of the Estonian language planning to standardize the meanings of general language through the formal Dictionary of Standard Estonian (DSE) has long influenced society’s understanding of how (Estonian) language works. Since the 1980s, instead of standardizing the meanings, the language planning has been more “lenient”, considering certain meanings inappropriate for the standard language and making recommendations on them (e.g. in the DSE 1999–2018). In this article, we will give an overview of the history of the standardization of word meanings in the Estonian language planning since the second half of the 19th century, analyzing the most important publications related to it. We draw generalizations and conclusions about the standardization of meanings based on our earlier research on word meanings: with the deepening of usage-based linguistics and corpus linguistics in Estonian linguistics, the understanding has been reached in language planning that meanings do not remain fixed in the dictionary: the (new) meaning’s survival in use depends on internal and external factors, not on the formal DSE

    State-of-the-art on monolingual lexicography for Estonia

    Get PDF
    The paper describes the state of the art of monolingual lexicography in Estonia. Firstly, we describe the current situation in Estonia and the main public functions performed by the Institute of the Estonian Language. Secondly, we provide an overview of the primary types of monolingual academic dictionaries (dictionaries of Standard Estonian and explanatory dictionaries) published in Estonia since the 20th century. Monolingual learner’s lexicography has emerged as a new field in the 2010s, focusing on basic vocabulary and collocations. Thirdly, we give a short overview of accessibility policy and availability of language resources for Estonian. Finally, we envisage the future work in the field of lexicography in the Institute. Within the framework of the new dictionary writing system Ekilex the Institute is moving away from presenting separate interfaces for different dictionaries towards a unified data model in order to provide the data in the aggregated form

    D3.8 Lexical-semantic analytics for NLP

    Get PDF
    UIDB/03213/2020 UIDP/03213/2020The present document illustrates the work carried out in task 3.3 (work package 3) of ELEXIS project focused on lexical-semantic analytics for Natural Language Processing (NLP). This task aims at computing analytics for lexical-semantic information such as words, senses and domains in the available resources, investigating their role in NLP applications. Specifically, this task concentrates on three research directions, namely i) sense clustering, in which grouping senses based on their semantic similarity improves the performance of NLP tasks such as Word Sense Disambiguation (WSD), ii) domain labeling of text, in which the lexicographic resources made available by the ELEXIS project for research purposes allow better performances to be achieved, and finally iii) analysing the diachronic distribution of senses, for which a software package is made available.publishersversionpublishe

    Süntaktiline info sõnastikus: probleeme ja väljavaateid

    No full text

    An insight into lexicographic practices in Europe Results of the extended ELEXIS Survey on User Needs

    No full text
    The paper presents the results of a survey on lexicographic practices and lexicographers’ needs across Europe that was conducted in the context of the Horizon 2020 project European Lexicographic Infrastructure (ELEXIS) among the observer institutions of the project. The survey is a revised and upgraded version of the survey which was originally conducted among ELEXIS lexicographic partner institutions in 2018 (Kallas et al. 2019a). The main goal of this new survey was to complement the data from the ELEXIS lexicographic partner institutions in order to get a more complete picture of lexicographic practices both for born-digital and retro-digitised resources in Europe. The results offer a detailed insight into many aspects of the lexicographic process at European institutions, such as funding, training, staff, lexicographic expertise, software and tools. In addition, the survey reflects on current trends in lexicography and reveals what institutions see as the most important emerging trends that will affect lexicography in the short-term and long-term future. Overall, the results provide valuable input informing the development of tools, resources, guidelines and training materials within ELEXIS

    The EKI Combined Dictionary 2022 (ELEXIS)

    No full text
    Eesti Keele Ühendsõnastik 2022 (EKI Combined Dictionary 2022) displays information from different lexical databases: "The Dictionary of Estonian 2019", "Estonian Collocations Dictionary 2019", "Basic Estonian Dictionary" (2014), "The Estonian Morphological Database of the Institute of the Estonian Language 2022". It displays also information from bilingual lexical databases: "Estonian-Russian orthographic dictionary for students 2018" (1st edition 2011), "Estonian-Russian Dictionary 2018" (1st edition 1997–2009), "The Russian Morphological Database of the Institute of the Estonian Language 2022". The data is stored in Ekilex's PostgreSQL database and accessible through API. Ekilex is in-house DWS of the Institite of the Estonian Language. Ekilex is hosted in the Estonian Scientific Computing Infrastructure (ETAIS) cloud. See also: https://doi.org/10.15155/3-00-0000-0000-0000-08C0A

    Parallel sense-annotated corpus ELEXIS-WSD 1.0

    No full text
    ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene. The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language. The sentences were tokenized, lemmatized, and tagged with POS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene The corpus is available in a CONLL-like tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, its whitespace information (whether the token is followed by a whitespace or not), the ID of the sense assigned to the token, and the index of the multiword expression (if the token is part of an annotated multiword expression). Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between. For more information, please refer to 00README.txt

    A multilingual evaluation dataset for monolingual word sense alignment

    Get PDF
    Aligning senses across resources and languages is a challenging task with beneficial applications in the field of natural language processing and electronic lexicography. In this paper, we describe our efforts in manually aligning monolingual dictionaries. The alignment is carried out at sense-level for various resources in 15 languages. Moreover, senses are annotated with possible semantic relationships such as broadness, narrowness, relatedness, and equivalence. In comparison to previous datasets for this task, this dataset covers a wide range of languages and resources and focuses on the more challenging task of linking general-purpose language. We believe that our data will pave the way for further advances in alignment and evaluation of word senses by creating new solutions, particularly those notoriously requiring data such as neural networks. Our resources are publicly available at https://github.com/elexis-eu/MWSA.The authors would like to thank the three anonymous reviewers for their insightful suggestions and careful reading of the manuscript. This work has received funding from the EU’s Horizon 2020 Research and Innovation programme through the ELEXIS project under grant agreement No. 731015. The contributions in Bulgarian were partially funded by the Bulgarian National Interdisciplinary Research e-Infrastructure for Resources and Technologies in favor of the Bulgarian Language and Cultural Heritage, part of the EU infrastructures CLARIN and DARIAH – CLaDA-BG, Grant number DO1- 272/16.12.2019. This work is also supported by Sci- ence Foundation Ireland (SFI) under the Insight Center for Data Analytics (Grant Number SFI/12/RC/2289) and the Irish Research Council under the “Cardamom” Consolidator Laureate Grant (IRCLA/2017/129).peer-reviewed2020-05-1
    corecore